16. Review: Dropout
Dropout and Momentum
The next solution will show a different (improved) model for clothing classification. It has two main differences when compared to the first solution:
- It has an additional dropout layer
- It includes a momentum term in the optimizer: stochastic gradient descent
So, why are these improvements?
Dropout
Dropout randomly turns off perceptrons (nodes) that make up the layers of our network, with some specified probability. It may seem counterintuitive to throw away a connection in our network, but as a network trains, some nodes can dominate others or end up making large mistakes, and dropout gives us a way to balance our network so that every node works equally towards the same goal, and if one makes a mistake, it won't dominate the behavior of our model. You can think of dropout as a technique that makes a network resilient; it makes all the nodes work well as a team by making sure no node is too weak or too strong. In fact it makes me think of the Chaos Monkey tool that is used to test for system/website failures.
I encourage you to look at the PyTorch dropout documentation, here, to see how to add these layers to a network.
For a recap of what dropout does, check out Luis's video below.
Dropout
Momentum
When you train a network, you specify an optimizer that aims to reduce the errors that your network makes during training. The errors that it makes should generally reduce over time but there may be some bumps along the way. Gradient descent optimization relies on finding a local minimum for an error, but it has trouble finding the global minimum which is the lowest an error can get. So, we add a momentum term to help us find and then move on from local minimums and find the global minimum!
Check out the video below for a review of how momentum works, mathematically.
动量